Movie Revenue Prediction - UMLND Capstone Project

Objective of this project is to predict a movie revenue based on historic data about movie revenues and performance at global box office.

Such a prediction would be useful for optimization in many areas during various stages of planning and production of movies, for instance, selection of actors, crew, location, production spend, marketing spend, logistics and so on. Predicting revenue potential would enable movie production enterprises make wise investment decisions, come up with movies with plots relevant to society, higher entertainment satisfaction and ultimately greater good of all the involved parties.

Data Exploration

In this public competition hosted by Kaggle, I am presented with metadata on past films from The Movie Database to try and predict their overall worldwide box office revenue. Data points provided include cast, crew, plot keywords, budget, posters, release dates, languages, production companies, and countries.

Data is downloaded from this link https://www.kaggle.com/c/tmdb-box-office-prediction/data

Since the dataset is small enough, I made a local copy and checked in to git along with project.

In [1]:
!conda env list
# conda environments:
#
airflow                  /home/srini/.conda/envs/airflow
capstone                 /home/srini/.conda/envs/capstone
dog-project              /home/srini/.conda/envs/dog-project
mlflow                   /home/srini/.conda/envs/mlflow
quadcop                  /home/srini/.conda/envs/quadcop
rstudio                  /home/srini/.conda/envs/rstudio
base                  *  /opt/anaconda3

In [2]:
# Import usual modules to get going, basic visualiation, date, Json, progress bar, ....
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
import warnings
from tqdm import tqdm
from datetime import datetime
import time
import json

# Import required sklearn modules
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import KFold, train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_log_error
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV

# Import main algorithm modules
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor

# Follwing modules are for visualizations
from wordcloud import WordCloud
import eli5
import shap
shap.initjs()

warnings.filterwarnings("ignore")
%matplotlib inline

Read training dataset and explore to get familiarized with data

Review data structure, get a feel for data types and null values

In [3]:
train = pd.read_csv('data/train.csv')
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 23 columns):
id                       3000 non-null int64
belongs_to_collection    604 non-null object
budget                   3000 non-null int64
genres                   2993 non-null object
homepage                 946 non-null object
imdb_id                  3000 non-null object
original_language        3000 non-null object
original_title           3000 non-null object
overview                 2992 non-null object
popularity               3000 non-null float64
poster_path              2999 non-null object
production_companies     2844 non-null object
production_countries     2945 non-null object
release_date             3000 non-null object
runtime                  2998 non-null float64
spoken_languages         2980 non-null object
status                   3000 non-null object
tagline                  2403 non-null object
title                    3000 non-null object
Keywords                 2724 non-null object
cast                     2987 non-null object
crew                     2984 non-null object
revenue                  3000 non-null int64
dtypes: float64(2), int64(3), object(18)
memory usage: 539.1+ KB

There are 3000 records in dataset, looking at data types and null values in few of the columns, it will be an interesting challenge with data exploration and exploratory data analysis.

Review test data set as well...

In [4]:
test = pd.read_csv('data/test.csv')
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4398 entries, 0 to 4397
Data columns (total 22 columns):
id                       4398 non-null int64
belongs_to_collection    877 non-null object
budget                   4398 non-null int64
genres                   4382 non-null object
homepage                 1420 non-null object
imdb_id                  4398 non-null object
original_language        4398 non-null object
original_title           4398 non-null object
overview                 4384 non-null object
popularity               4398 non-null float64
poster_path              4397 non-null object
production_companies     4140 non-null object
production_countries     4296 non-null object
release_date             4397 non-null object
runtime                  4394 non-null float64
spoken_languages         4356 non-null object
status                   4396 non-null object
tagline                  3535 non-null object
title                    4395 non-null object
Keywords                 4005 non-null object
cast                     4385 non-null object
crew                     4376 non-null object
dtypes: float64(2), int64(2), object(18)
memory usage: 756.0+ KB

Exploratory data analysis (EDA)

Let us jump in to some exploratory data analysis, take a look at few records to get a feel of actual data.

We have some JSON key value pairs, free form text, categorical variables and continuous variables, as well as release date. Dream come true for aspiring ML Engineer to play with this kind of data!

In [5]:
train.head()
Out[5]:
id belongs_to_collection budget genres homepage imdb_id original_language original_title overview popularity ... release_date runtime spoken_languages status tagline title Keywords cast crew revenue
0 1 [{'id': 313576, 'name': 'Hot Tub Time Machine ... 14000000 [{'id': 35, 'name': 'Comedy'}] NaN tt2637294 en Hot Tub Time Machine 2 When Lou, who has become the "father of the In... 6.575393 ... 2/20/15 93.0 [{'iso_639_1': 'en', 'name': 'English'}] Released The Laws of Space and Time are About to be Vio... Hot Tub Time Machine 2 [{'id': 4379, 'name': 'time travel'}, {'id': 9... [{'cast_id': 4, 'character': 'Lou', 'credit_id... [{'credit_id': '59ac067c92514107af02c8c8', 'de... 12314651
1 2 [{'id': 107674, 'name': 'The Princess Diaries ... 40000000 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... NaN tt0368933 en The Princess Diaries 2: Royal Engagement Mia Thermopolis is now a college graduate and ... 8.248895 ... 8/6/04 113.0 [{'iso_639_1': 'en', 'name': 'English'}] Released It can take a lifetime to find true love; she'... The Princess Diaries 2: Royal Engagement [{'id': 2505, 'name': 'coronation'}, {'id': 42... [{'cast_id': 1, 'character': 'Mia Thermopolis'... [{'credit_id': '52fe43fe9251416c7502563d', 'de... 95149435
2 3 NaN 3300000 [{'id': 18, 'name': 'Drama'}] http://sonyclassics.com/whiplash/ tt2582802 en Whiplash Under the direction of a ruthless instructor, ... 64.299990 ... 10/10/14 105.0 [{'iso_639_1': 'en', 'name': 'English'}] Released The road to greatness can take you to the edge. Whiplash [{'id': 1416, 'name': 'jazz'}, {'id': 1523, 'n... [{'cast_id': 5, 'character': 'Andrew Neimann',... [{'credit_id': '54d5356ec3a3683ba0000039', 'de... 13092000
3 4 NaN 1200000 [{'id': 53, 'name': 'Thriller'}, {'id': 18, 'n... http://kahaanithefilm.com/ tt1821480 hi Kahaani Vidya Bagchi (Vidya Balan) arrives in Kolkata ... 3.174936 ... 3/9/12 122.0 [{'iso_639_1': 'en', 'name': 'English'}, {'iso... Released NaN Kahaani [{'id': 10092, 'name': 'mystery'}, {'id': 1054... [{'cast_id': 1, 'character': 'Vidya Bagchi', '... [{'credit_id': '52fe48779251416c9108d6eb', 'de... 16000000
4 5 NaN 0 [{'id': 28, 'name': 'Action'}, {'id': 53, 'nam... NaN tt1380152 ko 마린보이 Marine Boy is the story of a former national s... 1.148070 ... 2/5/09 118.0 [{'iso_639_1': 'ko', 'name': '한국어/조선말'}] Released NaN Marine Boy NaN [{'cast_id': 3, 'character': 'Chun-soo', 'cred... [{'credit_id': '52fe464b9251416c75073b43', 'de... 3923970

5 rows × 23 columns

Descriptive stats in train and test datasets.

In [6]:
train.describe(include='all')
Out[6]:
id belongs_to_collection budget genres homepage imdb_id original_language original_title overview popularity ... release_date runtime spoken_languages status tagline title Keywords cast crew revenue
count 3000.000000 604 3.000000e+03 2993 946 3000 3000 3000 2992 3000.000000 ... 3000 2998.000000 2980 3000 2403 3000 2724 2987 2984 3.000000e+03
unique NaN 422 NaN 872 941 3000 36 2975 2992 NaN ... 2398 NaN 401 2 2400 2969 2648 2975 2984 NaN
top NaN [{'id': 645, 'name': 'James Bond Collection', ... NaN [{'id': 18, 'name': 'Drama'}] http://www.transformersmovie.com/ tt0780571 en Hot Pursuit Hitchcock follows the relationship between dir... NaN ... 9/10/15 NaN [{'iso_639_1': 'en', 'name': 'English'}] Released Based on a true story. Lolita [{'id': 10183, 'name': 'independent film'}] [] [{'credit_id': '52fe436d9251416c7500fe95', 'de... NaN
freq NaN 16 NaN 266 4 1 2575 2 1 NaN ... 5 NaN 1817 2996 3 2 27 13 1 NaN
mean 1500.500000 NaN 2.253133e+07 NaN NaN NaN NaN NaN NaN 8.463274 ... NaN 107.856571 NaN NaN NaN NaN NaN NaN NaN 6.672585e+07
std 866.169729 NaN 3.702609e+07 NaN NaN NaN NaN NaN NaN 12.104000 ... NaN 22.086434 NaN NaN NaN NaN NaN NaN NaN 1.375323e+08
min 1.000000 NaN 0.000000e+00 NaN NaN NaN NaN NaN NaN 0.000001 ... NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 1.000000e+00
25% 750.750000 NaN 0.000000e+00 NaN NaN NaN NaN NaN NaN 4.018053 ... NaN 94.000000 NaN NaN NaN NaN NaN NaN NaN 2.379808e+06
50% 1500.500000 NaN 8.000000e+06 NaN NaN NaN NaN NaN NaN 7.374861 ... NaN 104.000000 NaN NaN NaN NaN NaN NaN NaN 1.680707e+07
75% 2250.250000 NaN 2.900000e+07 NaN NaN NaN NaN NaN NaN 10.890983 ... NaN 118.000000 NaN NaN NaN NaN NaN NaN NaN 6.891920e+07
max 3000.000000 NaN 3.800000e+08 NaN NaN NaN NaN NaN NaN 294.337037 ... NaN 338.000000 NaN NaN NaN NaN NaN NaN NaN 1.519558e+09

11 rows × 23 columns

In [7]:
test.describe(include='all')
Out[7]:
id belongs_to_collection budget genres homepage imdb_id original_language original_title overview popularity ... production_countries release_date runtime spoken_languages status tagline title Keywords cast crew
count 4398.000000 877 4.398000e+03 4382 1420 4398 4398 4398 4384 4398.000000 ... 4296 4397 4394.000000 4356 4396 3535 4395 4005 4385 4376
unique NaN 556 NaN 1101 1402 4398 39 4353 4383 NaN ... 458 3289 NaN 526 3 3529 4342 3885 4365 4376
top NaN [{'id': 645, 'name': 'James Bond Collection', ... NaN [{'id': 18, 'name': 'Drama'}] http://www.workandtheglory.com/ tt3722070 en The Rookie No overview found. NaN ... [{'iso_3166_1': 'US', 'name': 'United States o... 9/9/11 NaN [{'iso_639_1': 'en', 'name': 'English'}] Released Be careful what you wish for. Cinderella [{'id': 187056, 'name': 'woman director'}] [] [{'credit_id': '52fe43f1c3a368484e007033', 'de...
freq NaN 10 NaN 348 3 1 3776 2 2 NaN ... 2587 7 NaN 2704 4389 2 2 30 21 1
mean 5199.500000 NaN 2.264929e+07 NaN NaN NaN NaN NaN NaN 8.550230 ... NaN NaN 107.622212 NaN NaN NaN NaN NaN NaN NaN
std 1269.737571 NaN 3.689991e+07 NaN NaN NaN NaN NaN NaN 12.209014 ... NaN NaN 21.058290 NaN NaN NaN NaN NaN NaN NaN
min 3001.000000 NaN 0.000000e+00 NaN NaN NaN NaN NaN NaN 0.000001 ... NaN NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN
25% 4100.250000 NaN 0.000000e+00 NaN NaN NaN NaN NaN NaN 3.895186 ... NaN NaN 94.000000 NaN NaN NaN NaN NaN NaN NaN
50% 5199.500000 NaN 7.450000e+06 NaN NaN NaN NaN NaN NaN 7.482241 ... NaN NaN 104.000000 NaN NaN NaN NaN NaN NaN NaN
75% 6298.750000 NaN 2.800000e+07 NaN NaN NaN NaN NaN NaN 10.938524 ... NaN NaN 118.000000 NaN NaN NaN NaN NaN NaN NaN
max 7398.000000 NaN 2.600000e+08 NaN NaN NaN NaN NaN NaN 547.488298 ... NaN NaN 320.000000 NaN NaN NaN NaN NaN NaN NaN

11 rows × 22 columns

Missing value analysis

Count of missing values in each column, in both train and test datasets. Gives an early indication on which columns need appropriate handling for missing values and strategies for handling them.

In [8]:
train.isna().sum()
Out[8]:
id                          0
belongs_to_collection    2396
budget                      0
genres                      7
homepage                 2054
imdb_id                     0
original_language           0
original_title              0
overview                    8
popularity                  0
poster_path                 1
production_companies      156
production_countries       55
release_date                0
runtime                     2
spoken_languages           20
status                      0
tagline                   597
title                       0
Keywords                  276
cast                       13
crew                       16
revenue                     0
dtype: int64
In [9]:
test.isna().sum()
Out[9]:
id                          0
belongs_to_collection    3521
budget                      0
genres                     16
homepage                 2978
imdb_id                     0
original_language           0
original_title              0
overview                   14
popularity                  0
poster_path                 1
production_companies      258
production_countries      102
release_date                1
runtime                     4
spoken_languages           42
status                      2
tagline                   863
title                       3
Keywords                  393
cast                       13
crew                       22
dtype: int64

Data Profiling

Using pandas_profiling to accelerate data exploration. For details on this module -> https://pandas-profiling.github.io/pandas-profiling/docs/

And a good article discussing advantages of using such package for data science efforts -> https://towardsdatascience.com/a-better-eda-with-pandas-profiling-e842a00e1136

In [10]:
import pandas_profiling
train.profile_report(style={'full_width':True})
Out[10]:

Visualizations to develop intuition

Joint plots to start visualizing relationships between numeric variables...

Budget vs Revenue

In [11]:
sns.jointplot(x="budget", y="revenue", data=train, height=10, ratio=5, color="b")
plt.show()

Popularity vs Revenue

In [12]:
sns.jointplot(x="popularity", y="revenue", data=train, height=10, ratio=5, color="b")
plt.show()

Runtime vs Revenue

In [13]:
sns.jointplot(x="runtime", y="revenue", data=train, height=10, ratio=5, color="g")
plt.show()

Revenue Distribution

Majority of movies are near the zero line, 75% of them under 68 Million. So, we need to apply log scaling to make the data distribution conducive to analysis/modeling

In [14]:
sns.distplot(train.revenue)
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f45ad7b80f0>
In [15]:
train.revenue.describe()
Out[15]:
count    3.000000e+03
mean     6.672585e+07
std      1.375323e+08
min      1.000000e+00
25%      2.379808e+06
50%      1.680707e+07
75%      6.891920e+07
max      1.519558e+09
Name: revenue, dtype: float64

Apply log1p function to take care of any odd cases with 0 revenues breaking application of log

In [16]:
train['logRevenue'] = np.log1p(train['revenue'])
sns.distplot(train['logRevenue'])
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f45ae9b4da0>

Release date exploration

In [17]:
train['release_date'].head()
Out[17]:
0     2/20/15
1      8/6/04
2    10/10/14
3      3/9/12
4      2/5/09
Name: release_date, dtype: object

Since only last two digits of the year are provided, let us manipulate and get correct year

In [18]:
# Split release date to individual parts, input is just a string sepearated by /
train[['release_month','release_day','release_year']]=train['release_date'].str.split('/',expand=True).replace(np.nan, -1).astype(int)
In [19]:
train['release_year'].max()
Out[19]:
99
In [20]:
# this might sound like odd logic, but that is best we can do, as we are in 2019!
train.loc[ (train['release_year'] <= 19) & (train['release_year'] < 100) , "release_year"] += 2000
train.loc[ (train['release_year'] > 19) & (train['release_year'] < 100) , "release_year"] += 1900

Additional Features from Release Date

  • Day of the week of release
  • quarter of release
In [21]:
# Make new features from date, day of week and quarter...
releaseDate = pd.to_datetime(train['release_date']) 
train['release_dayofweek'] = releaseDate.dt.dayofweek
train['release_quarter'] = releaseDate.dt.quarter

Plot Release Year Count - shows significant uptick of movie releases from 1980's onwards, makes sense as society becomes more affluent, with noticeable dips due to economical conditions.

In [22]:
plt.figure(figsize=(20,10))
sns.countplot(train['release_year'].sort_values())
plt.title("Movie Release counts by year")
loc, labels = plt.xticks()
plt.xticks(rotation=90)
plt.show()

Plot Release Month Counts Shows month wise variations in release counts, interesting observation that late summer and early fall has maximum number of releases.

In [23]:
plt.figure(figsize=(20,10))
sns.countplot(train['release_month'].sort_values())
plt.title("Movie Release counts by Month")
loc, labels = plt.xticks()
loc, labels = loc, ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
plt.xticks(loc, labels)
plt.show()

Plot Release Day Counts Mostly stable, First day of the month and middle of the month have many releases - Does it match with pay patterns of twice a month payments for salaried moviegoers? May be, but would be a tangent for problem at hand.

In [24]:
plt.figure(figsize=(20,10))
sns.countplot(train['release_day'].sort_values())
plt.title("Release Day Count")
plt.xticks()
plt.show()

Plot Release Day of Week Perfect, matches with intuition here, majority of movies are released on Fridays' followed by Thursdays' as moviegoers are eager to spend sometime on their weekend for entertainment.

In [25]:
plt.figure(figsize=(20,10))
sns.countplot(train['release_dayofweek'].sort_values())
plt.title("Movies released on day of week")
loc, labels = plt.xticks()
loc, labels = loc, ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
plt.xticks(loc, labels)
plt.show()

Plot Release quarter Counts No surprises here, matches with monthly trends, Summer and Fall have higher releases compared to other two quarters.

In [26]:
plt.figure(figsize=(20,10))
sns.countplot(train['release_quarter'].sort_values())
plt.title("Movies released in quarter")
plt.show()

Plot Release Year vs Revenue Good, matches with intuition and upward trend with movie release counts from 80's following valleys due to economic fluctuations.

In [27]:
# Using Pandas groupby method to get mean revenues by year.
train['meanRevenueByYear'] = train.groupby("release_year")["revenue"].aggregate('mean')
train['meanRevenueByYear'].plot(figsize=(20,10),color="r")
plt.xticks(np.arange(1920,2018,4))
plt.xlabel("Release Year")
plt.ylabel("Revenue")
plt.title("Movie Mean Revenue By Year")
plt.show()

Release Month vs Revenue Plot Interesting observation here, mean revenue peaks at June, and hits bottom during September (when maximum movies are released). Supply and demand?!

In [28]:
# Using Pandas groupby method to get mean revenues by release month.
train['meanRevenueByMonth'] = train.groupby("release_month")["revenue"].aggregate('mean')
train['meanRevenueByMonth'].plot(figsize=(20,10),color="r")
plt.xlabel("Release Month")
plt.ylabel("Revenue")
plt.title("Movie Mean Revenue Release Month")
plt.show()

Release Day of Week vs Revenue Graph indicates that Wednesday releases have peak revenue compared to Friday and Sunday. However many movies are released on Friday followed by Thursday. This is not significant in our analysis as it would suggest.

In [29]:
# Using Pandas groupby method to get mean revenues by release day of the week.
train['meanRevenueByDayOfWeek'] = train.groupby("release_dayofweek")["revenue"].aggregate('mean')
train['meanRevenueByDayOfWeek'].plot(figsize=(20,10),color="r")
plt.xlabel("Day of Week")
plt.ylabel("Revenue")
plt.title("Movie Mean Revenue by Day of Week")
plt.show()

Movie Mean Runtime by year Visualizing average runtime of movies, we can see a trend movies were very long in early phases averaging 150 minutes, They hit a minimum during great recession, were bouncing around little less than 2 hour mark. It is interesting to mote that the average peaked just after II world war and again after Vietnam war? I think it is tangent exploring those aspects, it is more or less stable around 2 hours mark, let us move on...

In [30]:
# Using Pandas groupby method to get mean runtime by release year.
train['meanruntimeByYear'] = train.groupby("release_year")["runtime"].aggregate('mean')
train['meanruntimeByYear'].plot(figsize=(20,10),color="r")
plt.xticks(np.arange(1920,2018,4))
plt.xlabel("Release Year")
plt.ylabel("Runtime")
plt.title("Movie Mean Runtime by Year")
plt.show()

Mean popularity by Year General uptick in movie popularity, no surprises, except a huge rise in 2016

In [31]:
# Using Pandas groupby method to get mean popularity by release year.
train['meanPopularityByYear'] = train.groupby("release_year")["popularity"].aggregate('mean')
train['meanPopularityByYear'].plot(figsize=(20,15),color="r")
plt.xticks(np.arange(1920,2018,4))
plt.xlabel("Release Year")
plt.ylabel("Popularity")
plt.title("Movie Mean Popularity by Year")
plt.show()

Movie Mean Budget by Year General uptick in average budget spent on movies, matches with intuition on increased number of movies made over years. Matches with economic trends as well.

In [32]:
# Using Pandas groupby method to get mean budgets by release year.
train['meanBudgetByYear'] = train.groupby("release_year")["budget"].aggregate('mean')
train['meanBudgetByYear'].plot(figsize=(20,10),color="r")
plt.xticks(np.arange(1920,2018,4))
plt.xlabel("Release Year")
plt.ylabel("Budget")
plt.title("Movie Mean Budget by Year")
plt.show()

Count Genres in training dataset No surprises here, moviegoers go to movies to experience drama, data shows at least half of the movies are "Drama" genre. Followed by one third of them in "Comedy", one fourth in "Thriller" and "Action"

In [33]:
# Simple helper function to evaluate string which is used to build dict of uique Genres
def get_dictionary(s):
    try:
        d = eval(s)
    except:
        d = {}
    return d

train = train

# Expand out JSON string with Genres, one hot encode and sum on Genres columns. 
train['genres'] = train['genres'].map(lambda x: sorted([d['name'] for d in get_dictionary(x)])).map(lambda x: ','.join(map(str, x)))
genres = train.genres.str.get_dummies(sep=',')
train = pd.concat([train, genres], axis=1, sort=False)
for col in genres:
    print(col, "Genres Movie : ",  genres[col].sum())
Action Genres Movie :  741
Adventure Genres Movie :  439
Animation Genres Movie :  141
Comedy Genres Movie :  1028
Crime Genres Movie :  469
Documentary Genres Movie :  87
Drama Genres Movie :  1531
Family Genres Movie :  260
Fantasy Genres Movie :  232
Foreign Genres Movie :  31
History Genres Movie :  132
Horror Genres Movie :  301
Music Genres Movie :  100
Mystery Genres Movie :  225
Romance Genres Movie :  571
Science Fiction Genres Movie :  290
TV Movie Genres Movie :  1
Thriller Genres Movie :  789
War Genres Movie :  100
Western Genres Movie :  43

Count Genres in test dataset A quick check on test dataset to see if there are any oddities on Genres column, matches with overall trends, no issues here.

In [34]:
test = test

# Expand out JSON string with Genres, one hot encode and sum on Genres columns. 
test['genres'] = test['genres'].map(lambda x: sorted([d['name'] for d in get_dictionary(x)])).map(lambda x: ','.join(map(str, x)))
genres = test.genres.str.get_dummies(sep=',')
test = pd.concat([test, genres], axis=1, sort=False)
for col in genres:
    print(col, "Genres Movie : ",  genres[col].sum())
Action Genres Movie :  994
Adventure Genres Movie :  677
Animation Genres Movie :  241
Comedy Genres Movie :  1577
Crime Genres Movie :  615
Documentary Genres Movie :  134
Drama Genres Movie :  2145
Family Genres Movie :  415
Fantasy Genres Movie :  396
Foreign Genres Movie :  53
History Genres Movie :  163
Horror Genres Movie :  434
Music Genres Movie :  167
Mystery Genres Movie :  325
Romance Genres Movie :  864
Science Fiction Genres Movie :  454
Thriller Genres Movie :  1080
War Genres Movie :  143
Western Genres Movie :  74

Original Language Counts Most movies, > 80% are English probably due to popularity of English language and popularity of Hollywood movies, would be concerning to generalize the analysis and model we build to worldwide predictions, especially when movies are becoming popular worldwide and other languages are catching up quickly.

In [35]:
plt.figure(figsize=(20,10))
sns.countplot(train['original_language'].sort_values())
plt.title("Original Language Counts")
plt.show()

Status Analysis A quick peak at counts of "status" of movies

In [36]:
train['status'].value_counts()
Out[36]:
Released    2996
Rumored        4
Name: status, dtype: int64

Do we have revenue for 4 movies? how is it possible for movies not released yet? Something to consider while doing feature engineering on how to handle these exceptions.

In [37]:
train.loc[train['status'] == "Rumored"][['status','revenue']]
Out[37]:
status revenue
609 Rumored 273644
1007 Rumored 60
1216 Rumored 13418091
1618 Rumored 229000

How about in test dataset? 4389 movies released, 7 are yet to be released.

In [38]:
test['status'].value_counts()
Out[38]:
Released           4389
Post Production       5
Rumored               2
Name: status, dtype: int64

_Homepage analysis. How many movies have a homepage? Could be used as a binary feature.

In [39]:
train['has_homepage'] = 1
train.loc[pd.isnull(train['homepage']) ,"has_homepage"] = 0
plt.figure(figsize=(20,10))
sns.countplot(train['has_homepage'].sort_values())
plt.title("Has Homepage?")
plt.show()

Correlation between has_homepage and revenue From visualization, we can infer that movies with home page generally have higher revenues, they are more popular.

In [40]:
sns.catplot(x="has_homepage", y="revenue", data=train)
plt.title("Revenue of movies with/wihtout homepage")
Out[40]:
Text(0.5, 1.0, 'Revenue of movies with/wihtout homepage')

Does Movie has tagline? How does it affect revenue? Analysis indicates movies with Tagline are more popular and command more revenues.

In [41]:
train['isTaglineNA'] = 0
train.loc[pd.isnull(train['tagline']) ,"isTaglineNA"] = 1
sns.catplot(x="isTaglineNA", y="revenue", data=train)
plt.title('Revenue of movies with/without tagline');

Is movie title different? Some movies have same titles, analyzing effect of title being same and different with categorical plot and impact on revenue, we can conclude that if title is same movies command more revenue, probably effect of previous title success carrying over to new release with same title.

In [42]:
train['isTitleDifferent'] = 1
train.loc[ train['original_title'] == train['title'] ,"isTitleDifferent"] = 0 
sns.catplot(x="isTitleDifferent", y="revenue", data=train)
plt.title('Revenue of movies with single and multiple titles');

What is the effect of orginal language if it English on revenue? Agrees with analysis of Original Language Most of our training dataset is English movies.

In [43]:
train['isOriginalLanguageEng'] = 0 
train.loc[train['original_language'] == "en" ,"isOriginalLanguageEng"] = 1
sns.catplot(x="isOriginalLanguageEng", y="revenue", data=train)
plt.title('Revenue of movies when Original Language is English and Not English');

Additional features

We do have additional dataset with rating and total votes for movies. It would be interesting to augment given dataset with additional features, gives me an opportunity to explore joins, missing records due to joins, impute or decide on what to do with missing values. Keep building skills!

In [44]:
trainAdditionalFeatures = pd.read_csv('data/TrainAdditionalFeatures.csv')
testAdditionalFeatures = pd.read_csv('data/TestAdditionalFeatures.csv')

train = pd.merge(train, trainAdditionalFeatures, how='left', on=['imdb_id'])
test = pd.merge(test, testAdditionalFeatures, how='left', on=['imdb_id'])
In [45]:
trainAdditionalFeatures.head()
Out[45]:
imdb_id popularity2 rating totalVotes
0 tt0169547 16.217 8.0 6016.0
1 tt0119116 26.326 7.4 5862.0
2 tt0325980 28.244 7.7 11546.0
3 tt0266697 18.202 7.9 8638.0
4 tt0418763 9.653 6.6 1201.0

Missing value analysis in additional datasets

In [46]:
train['rating'].isna().sum()
Out[46]:
118
In [47]:
train['totalVotes'].isna().sum()
Out[47]:
118
In [48]:
test['rating'].isna().sum()
Out[48]:
179
In [49]:
test['totalVotes'].isna().sum()
Out[49]:
179

Quite a few additional features have missing records. So, let us fill them with some reasonable defaults, 5 votes and a rating of 1.5, cautious defaulting to low values.

In [50]:
train['rating'] = train['rating'].fillna(1.5)
train['totalVotes'] = train['totalVotes'].fillna(5)

test['rating'] = test['rating'].fillna(1.5)
test['totalVotes'] = test['totalVotes'].fillna(5)

Training set Rating Visualization Apart from the defaulting for missing values, data follows roughly normal distribution, great!

In [51]:
plt.figure(figsize=(20,10))
sns.countplot(train['rating'].sort_values())
plt.title("Training dataset Rating Count")
plt.show()

Testing dataset rating visualization Roughly follows training dataset trends, no concerns on using this as feature.

In [52]:
plt.figure(figsize=(20,10))
sns.countplot(test['rating'].sort_values())
plt.title("Test dataset Rating Count")
plt.show()

Let us explore significance of rating on Mean Revenue in Training dataset. A rating of 6 has peak revenue, followed by 7 and 8, this would be good feature to predict on!

In [53]:
# Using Pandas groupby method to get mean revenue by ratings
train['meanRevenueByRating'] = train.groupby("rating")["revenue"].aggregate('mean')
train['meanRevenueByRating'].plot(figsize=(20,10),color="r")
plt.xlabel("Rating")
plt.ylabel("Revenue")
plt.title("Movie Mean Revenue By Rating")
plt.show()

Mean Revenue vs Total Votes visualization. Movies with votes in the range of 900 to 1600 tend to command good revenues

In [54]:
# Using Pandas groupby method to get mean revenue by total votes
train['meanRevenueByTotalVotes'] = train.groupby("totalVotes")["revenue"].aggregate('mean')
train['meanRevenueByTotalVotes'].plot(figsize=(20,10),color="r")
plt.xticks(np.arange(0,3000,1000))
plt.xlabel("Total Votes")
plt.ylabel("Revenue")
plt.title("Movie Mean Revenue By Total Votes")
plt.show()

trends of total votes over release year follow overall uptick with movie releases over year, gives confidence we could use this as predictor feature.

In [55]:
# Using Pandas groupby method to get mean total votes by release year
train['meantotalVotesByYear'] = train.groupby("release_year")["totalVotes"].aggregate('mean')
train['meantotalVotesByYear'].plot(figsize=(20,10),color="r")
plt.xticks(np.arange(1920,2018,4))
plt.xlabel("Release Year")
plt.ylabel("TotalVotes")
plt.title("Movie Mean Total Votes By Release Year")
plt.show()

Total Votes vs Rating Visualization to analyze how they could potentially be interrelated.

In [56]:
# Using Pandas groupby method to get mean total votes by ratings
train['meanTotalVotesByRating'] = train.groupby("rating")["totalVotes"].aggregate('mean')
train['meanTotalVotesByRating'].plot(figsize=(20,10),color="r")
plt.xlabel("Rating")
plt.ylabel("Total Votes")
plt.title("Movie Mean Total Votes by Rating")
plt.show()

Word Cloud Visualization - Original Title

In [57]:
# Visualize a word Cloud for original title, to get a feel fo what words are important...

plt.figure(figsize = (12, 12))
text = ' '.join(train['original_title'].values)
wordcloud = WordCloud(max_font_size=None, background_color='white', width=1200, height=1000).generate(text)
plt.imshow(wordcloud)
plt.title('Top words in titles')
plt.axis("off")
plt.show()

Word Cloud Visualization - Overview

In [58]:
# Visualize a word Cloud for Overview, to get a feel fo what words are important...

plt.figure(figsize = (12, 12))
text = ' '.join(train['overview'].fillna('').values)
wordcloud = WordCloud(max_font_size=None, background_color='white', width=1200, height=1000).generate(text)
plt.imshow(wordcloud)
plt.title('Top words in overview')
plt.axis("off")
plt.show()

Word Cloud Visualization - Tagline

In [59]:
# Visualize a word Cloud for Tagline, to get a feel fo what words are important...
plt.figure(figsize = (12, 12))
text = ' '.join(train['tagline'].fillna('').values)
wordcloud = WordCloud(max_font_size=None, background_color='white', width=1200, height=1000).generate(text)
plt.imshow(wordcloud)
plt.title('Top words in tagline')
plt.axis("off")
plt.show()

Let us do a Heat Map for correlation visualizations

As expected, total votes and revenue show very high correlation at 0.77, as highly voted movies perform well at box office, followed by budget and revenue at 0.75, which makes sense as well as highly popular movies which collect large revenues typically have huge budgets. Popularity takes up next spot at 0.46 correlation to revenue, runtime at 0.22. It is surprising to see rating is only at 0.17. Release year and release day of week and release month show very weak to very weak negative correlation.

In [60]:
# Heatmap visusalization to better understand cross corelations. So far we have been doing two at a time.

train_viz = train[['budget','rating','totalVotes','popularity','runtime','release_year','release_month','release_dayofweek','revenue']]
f,ax = plt.subplots(figsize=(20, 10))
sns.heatmap(train_viz.corr(), annot=True)
plt.show()

With thorough understanding of data used, including additional features, let us tackle feature engineering. We could do this modern ML Engineering way with breaking down into modules, building tests etc., however, I chose to do this within notebook to stay focused on project and finishing it on time, instead of getting side tracked with beautification and production grade hardening of ML Feature engineering pipeline.

Feature Engineering

Putting together various data exploration activities, transformation snippets, handing JSON, modularizing where possible, wrote a data_pref function.

In [61]:
# Helper method to get dictionary by evaluating argument
def get_dictionary(s):
    try:
        d = eval(s)
    except:
        d = {}
    return d

# Helper method to get dict with counts for keys in a given list of JSON columns
def get_json_dict(df) :
    global json_cols
    result = dict()
    for e_col in json_cols :
        d = dict()
        rows = df[e_col].values
        for row in rows :
            if row is None : continue
            for i in row :
                if i['name'] not in d :
                    d[i['name']] = 0
                d[i['name']] += 1
        result[e_col] = d
    return result

# The main data preparation/feature engineering method
def data_prep(df):
    
    # release_date handling, splitting and taking care of double digit years.
    df[['release_month','release_day','release_year']]=df['release_date'].str.split('/',expand=True).replace(np.nan, 0).astype(int)
    df['release_year'] = df['release_year']
    
    # This logic with hardcoding of 19 may sound odd, that is best we could do with given data
    # we are hardcoding 19, the years we are in...
    df.loc[ (df['release_year'] <= 19) & (df['release_year'] < 100), "release_year"] += 2000
    df.loc[ (df['release_year'] > 19) & (df['release_year'] < 100), "release_year"] += 1900
    releaseDate = pd.to_datetime(df['release_date']) 
    df['release_dayofweek'] = releaseDate.dt.dayofweek 
    df['release_quarter'] = releaseDate.dt.quarter     
    
    # Missing value handling for rating and total votes, they come from additional features dataset.
    # as we have many missing values, following logic imputes yearly means and merges.
    rating_na = df.groupby(["release_year","original_language"])['rating'].mean().reset_index()
    df[df.rating.isna()]['rating'] = df.merge(rating_na, how = 'left' ,on = ["release_year","original_language"])
    vote_count_na = df.groupby(["release_year","original_language"])['totalVotes'].mean().reset_index()
    df[df.totalVotes.isna()]['totalVotes'] = df.merge(vote_count_na, how = 'left' ,on = ["release_year","original_language"])
    
    # default any NaN to default values of 1.5 for rating and 5 for total votes, they fall outside the valid
    # range, won't break algorithms like NaNs do!
    df['rating'] = df['rating'].fillna(1.5)
    df.loc[ (df['rating'] == 0 ), "rating"] += 1.5
    df['totalVotes'] = df['totalVotes'].fillna(5)
    df.loc[ (df['totalVotes'] == 0 ), "totalVotes"] += 1.5
    
    # Ranges of rating and total votes are order of magnitude different. Let us get some wieighted rating
    df['weightedRating'] = ( df['rating']*df['totalVotes'] + 6.367 * 1000 ) / ( df['totalVotes'] + 1000 )

    # Budget amount... adjust for infation and take log, to adjust for skewness
    df['originalBudget'] = df['budget']
    df['inflationBudget'] = df['budget'] + df['budget']*2.1/100*(2018-df['release_year']) #Inflation assuming simplistic 2.1% per year
    df['budget'] = np.log1p(df['inflationBudget']) 
    
    # Gender aggregations
    df['genders_0_crew'] = df['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 0]))
    df['genders_1_crew'] = df['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 1]))
    df['genders_2_crew'] = df['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 2]))
  
    # small number of cases where runtime is 0... impute with mean runtime
    df.loc[ (df['runtime'] == 0 ), "runtime"] += df['runtime'].mean()  
    
    # Handle collections, perform label encoding, converting labels to numerical features
    df['_collection_name'] = df['belongs_to_collection'].apply(lambda x: x[0]['name'] if x != {} else 0)
    le = LabelEncoder()
    le.fit(list(df['_collection_name'].fillna('')))
    df['_collection_name'] = le.transform(df['_collection_name'].fillna('').astype(str))
    
    # count number of keywords, number of cast
    df['_num_Keywords'] = df['Keywords'].apply(lambda x: len(x) if x != {} else 0)
    df['_num_cast'] = df['cast'].apply(lambda x: len(x) if x != {} else 0)
    
    # compute mean poularity by release year
    df['_popularity_mean_year'] = df['popularity'] / df.groupby("release_year")["popularity"].transform('mean')
    
    # Compute few ratios with budget... to runtime, popularity, release year
    df['_budget_runtime_ratio'] = df['budget']/df['runtime'] 
    df['_budget_popularity_ratio'] = df['budget']/df['popularity']
    df['_budget_year_ratio'] = df['budget']/(df['release_year'])
    
    # Compute release year to poularity as well as inverse ratio
    df['_releaseYear_popularity_ratio'] = df['release_year']/df['popularity']
    df['_releaseYear_popularity_ratio2'] = df['popularity']/df['release_year']

    # More ratios of Popularity to total votes, ratings, release years
    df['_popularity_totalVotes_ratio'] = df['totalVotes']/df['popularity']
    df['_rating_popularity_ratio'] = df['rating']/df['popularity']
    df['_rating_totalVotes_ratio'] = df['totalVotes']/df['rating']
    df['_totalVotes_releaseYear_ratio'] = df['totalVotes']/df['release_year']
    
    # even more ratios of budget to rating, runtime to rating, budget by total votes
    df['_budget_rating_ratio'] = df['budget']/df['rating']
    df['_runtime_rating_ratio'] = df['runtime']/df['rating']
    df['_budget_totalVotes_ratio'] = df['budget']/df['totalVotes']
    
    # homepage - check for missing values
    df['has_homepage'] = 1
    df.loc[pd.isnull(df['homepage']) ,"has_homepage"] = 0
    
    # belongs to collection - check for missingvalues
    df['isbelongs_to_collectionNA'] = 0
    df.loc[pd.isnull(df['belongs_to_collection']) ,"isbelongs_to_collectionNA"] = 1
    
    # Tagline - check for missing values
    df['isTaglineNA'] = 0
    df.loc[df['tagline'] == 0 ,"isTaglineNA"] = 1 

    # Flag for original English language
    df['isOriginalLanguageEng'] = 0 
    df.loc[ df['original_language'] == "en" ,"isOriginalLanguageEng"] = 1
    
    # Flag for title change betwee original and current title
    df['isTitleDifferent'] = 1
    df.loc[ df['original_title'] == df['title'] ,"isTitleDifferent"] = 0 

    # Flag for movie release status
    df['isMovieReleased'] = 1
    df.loc[ df['status'] != "Released" ,"isMovieReleased"] = 0 

    # extract collection id from belongs_to_collection
    df['collection_id'] = df['belongs_to_collection'].apply(lambda x : np.nan if len(x)==0 else x[0]['id'])
    
    # get counts of letters/words for original title
    df['original_title_letter_count'] = df['original_title'].str.len() 
    df['original_title_word_count'] = df['original_title'].str.split().str.len()
    
    # get word count for title, overiew, tagline
    df['title_word_count'] = df['title'].str.split().str.len()
    df['overview_word_count'] = df['overview'].str.split().str.len()
    df['tagline_word_count'] = df['tagline'].str.split().str.len() 
    
    # get count of production countries and companies, crew and cast
    df['production_countries_count'] = df['production_countries'].apply(lambda x : len(x))
    df['production_companies_count'] = df['production_companies'].apply(lambda x : len(x))
    df['cast_count'] = df['cast'].apply(lambda x : len(x))
    df['crew_count'] = df['crew'].apply(lambda x : len(x))
    
    # get aggregates by release year, ratings vs votes
    df['meanruntimeByYear'] = df.groupby("release_year")["runtime"].aggregate('mean')
    df['meanPopularityByYear'] = df.groupby("release_year")["popularity"].aggregate('mean')
    df['meanBudgetByYear'] = df.groupby("release_year")["budget"].aggregate('mean')
    df['meantotalVotesByYear'] = df.groupby("release_year")["totalVotes"].aggregate('mean')
    df['meanTotalVotesByRating'] = df.groupby("rating")["totalVotes"].aggregate('mean')
    df['medianBudgetByYear'] = df.groupby("release_year")["budget"].aggregate('median')

    # For JSON columns, expand out and get counts per key
    for col in ['genres', 'production_countries', 'spoken_languages', 'production_companies']:
        df[col] = df[col].map(lambda x: sorted(list(set([n if n in train_dict[col] else col+'_etc' for n in [d['name'] for d in x]])))).map(lambda x: ','.join(map(str, x)))
        temp = df[col].str.get_dummies(sep=',')
        df = pd.concat([df, temp], axis=1, sort=False)
    # eliminate low frequency keys
    df.drop(['genres_etc'], axis = 1, inplace = True)
    
    # Drop source columns, as we have well engineered features ready to be modeled now!
    df = df.drop(['id', 'revenue','belongs_to_collection','genres','homepage','imdb_id','overview','runtime'
    ,'poster_path','production_companies','production_countries','release_date','spoken_languages'
    ,'status','title','Keywords','cast','crew','original_language','original_title','tagline', 'collection_id'
    ],axis=1)
    
    # replace any pesky NaNs still left with 0's, should not be any!
    df.fillna(value=0.0, inplace = True)

    return df

Read datasets, including (additional features)

In [62]:
# start of actual work, read both test and train datasets
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

# set dummy value in test dataset, it is for target, missing in test dataset.
test['revenue'] = np.nan

# Read additional feature datastes, merge with train and test datasets on IMDB_ID
train = pd.merge(train, pd.read_csv('data/TrainAdditionalFeatures.csv'), how='left', on=['imdb_id'])
test = pd.merge(test, pd.read_csv('data/TestAdditionalFeatures.csv'), how='left', on=['imdb_id'])

Quick sanity check on columns read and shape of dataframes

In [63]:
print(train.columns)
print(train.shape)
print(test.columns)
print(test.shape)
Index(['id', 'belongs_to_collection', 'budget', 'genres', 'homepage',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'runtime', 'spoken_languages',
       'status', 'tagline', 'title', 'Keywords', 'cast', 'crew', 'revenue',
       'popularity2', 'rating', 'totalVotes'],
      dtype='object')
(3000, 26)
Index(['id', 'belongs_to_collection', 'budget', 'genres', 'homepage',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'runtime', 'spoken_languages',
       'status', 'tagline', 'title', 'Keywords', 'cast', 'crew', 'revenue',
       'popularity2', 'rating', 'totalVotes'],
      dtype='object')
(4398, 26)

Data pre-processing

Handling JSON expansions, dropping low frequency values to optimize feature matrix.

In [64]:
# Apply log1p transformation on target, as it is skewed to the left.
train['revenue']= np.log1p(train['revenue'])
y = train['revenue'].values

# JSON specific processing, makes use of helper functions to build keys, value counts
json_cols = ['genres', 'belongs_to_collection', 'production_companies', 'production_countries', 'spoken_languages', 'Keywords', 'cast', 'crew']

for col in tqdm(json_cols) :
    train[col] = train[col].apply(lambda x : get_dictionary(x))
    test[col] = test[col].apply(lambda x : get_dictionary(x))

train_dict = get_json_dict(train)
test_dict = get_json_dict(test)

# remove categories with bias and low frequency
for col in json_cols :
    
    remove = []
    train_id = set(list(train_dict[col].keys()))
    test_id = set(list(test_dict[col].keys()))   
    
    remove += list(train_id - test_id) + list(test_id - train_id)
    for i in train_id.union(test_id) - set(remove) :
        if train_dict[col][i] < 10 or i == '' :
            remove += [i]
            
    for i in remove :
        if i in train_dict[col] :
            del train_dict[col][i]
        if i in test_dict[col] :
            del test_dict[col][i]
100%|██████████| 8/8 [00:08<00:00,  1.92s/it]

Feature Preparation

Merge both datasets, apply common feature engineering processing.

In [65]:
# Merge test and train datasets, perform feature engineering...
all_data = data_prep(pd.concat([train, test]).reset_index(drop = True))

#Split feature engineered data in to train and test datasets
train = all_data.loc[:train.shape[0] - 1,:]
test = all_data.loc[train.shape[0]:,:] 

Training

BenchMark Model

Decided to use a simple DecisionTreeRegressor with default arguments as the benchmark model. Other models' performance metrics will be compared to this model's performance for the chosen metric, minimizing RSMLE score for this regression prediction problem.

Let us split data in to train and validation with 80 to 20 ratio.

In [66]:
# A simple test and validation split with 80% trainig dataset and 20% validation dataset
# for benchmark model.
x_train, x_valid, y_train, y_valid = train_test_split(train,
                                                 y,
                                                 test_size=0.2,
                                                 random_state = 7)

# Show the results of the split
print("Training set has {} samples.".format(x_train.shape[0]))
print("validation set has {} samples.".format(x_valid.shape[0]))
Training set has 2400 samples.
validation set has 600 samples.

Train and predict for benchmark model, get RMSLE score

For benchmark model, RMSE is at 3.19, which serves as benchmark for further models to compare model prediction performance

In [67]:
# Fit regression model (benchmark model)
regr = DecisionTreeRegressor()

regr.fit(x_train, y_train)

# Predict
y_val_pred = regr.predict(x_valid)
In [68]:
# RMSLE on validation set
def rmsle(y_true, y_pred):
    return 'RMSLE', np.sqrt(np.mean(np.power(np.log1p(y_pred) - np.log1p(y_true), 2))), False
print(rmsle(y_val_pred, y_valid))
print("Mean Squared Log Error for Benchmark Model: {}".format(np.sqrt(mean_squared_log_error( y_valid, y_val_pred ))))      
print("Mean Squared Error for Benchmark Model: {}".format(np.sqrt(mean_squared_error( y_valid, y_val_pred ))))
('RMSLE', 0.3247479099194811, False)
Mean Squared Log Error for Benchmark Model: 0.3247479099194811
Mean Squared Error for Benchmark Model: 3.221423328790665

Training with chosen libraries and default parameters

LightGBM

In [69]:
# with defatul values
# (boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=100, 
# subsample_for_bin=200000, objective=None, class_weight=None, 
# min_split_gain=0.0, min_child_weight=0.001, min_child_samples=20,  
# subsample=1.0, subsample_freq=0, colsample_bytree=1.0, reg_alpha=0.0, reg_lambda=0.0, 
# random_state=None, n_jobs=-1, silent=True, importance_type='split', **kwargs)
core_params = {
    'objective': 'regression', # regression, multiclass, binary   
    'metric': 'rmse' # binary_logloss, mse, mae
}

modelilgb = lgb.LGBMRegressor(**core_params)
modelilgb.fit(x_train, y_train, 
        eval_set=[(x_train, y_train), (x_valid, y_valid)], eval_metric='rmse',
        verbose=1000, early_stopping_rounds=5)
Training until validation scores don't improve for 5 rounds.
Early stopping, best iteration is:
[50]	training's rmse: 1.18525	valid_1's rmse: 1.89299
Out[69]:
LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', learning_rate=0.1, max_depth=-1,
              metric='rmse', min_child_samples=20, min_child_weight=0.001,
              min_split_gain=0.0, n_estimators=100, n_jobs=-1, num_leaves=31,
              objective='regression', random_state=None, reg_alpha=0.0,
              reg_lambda=0.0, silent=True, subsample=1.0,
              subsample_for_bin=200000, subsample_freq=0)

XGBoost

In [70]:
# Train XGBoost model with default parameters
# (params, dtrain, num_boost_round=10, evals=(), obj=None, 
# feval=None, maximize=False, early_stopping_rounds=None, evals_result=None, 
# verbose_eval=True, xgb_model=None, callbacks=None, learning_rates=None)

params = {'eta': 0.3, 
              'objective': 'reg:linear',
              'max_depth': 6,
              'subsample': 1,
              'colsample_bytree': 1,
              'eval_metric': 'rmse',
              'seed': 0,
              'silent': True}

train_data = xgb.DMatrix(data=x_train, label=y_train)
valid_data = xgb.DMatrix(data=x_valid, label=y_valid)

watchlist = [(train_data, 'train'), (valid_data, 'valid_data')]
modelixgb = xgb.train(dtrain=train_data, num_boost_round=10, evals=watchlist, early_stopping_rounds=20, verbose_eval=500, params=params)
[0]	train-rmse:11.1448	valid_data-rmse:11.2111
Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.

Will train until valid_data-rmse hasn't improved in 20 rounds.
[9]	train-rmse:1.45188	valid_data-rmse:2.0527

CatBoost

In [71]:
cat_params = {'learning_rate': 1,
              'depth': 2,
              'allow_writing_files': False}

modelicat = CatBoostRegressor(iterations=5, **cat_params)
modelicat.fit(x_train, y_train, eval_set=(x_valid, y_valid), cat_features=[], use_best_model=True, verbose=True)
0:	learn: 2.9721822	test: 2.9262945	best: 2.9262945 (0)	total: 53.3ms	remaining: 213ms
1:	learn: 2.5093793	test: 2.4681281	best: 2.4681281 (1)	total: 57ms	remaining: 85.5ms
2:	learn: 2.4056549	test: 2.3753032	best: 2.3753032 (2)	total: 59.1ms	remaining: 39.4ms
3:	learn: 2.3639493	test: 2.3334647	best: 2.3334647 (3)	total: 60.9ms	remaining: 15.2ms
4:	learn: 2.3449432	test: 2.2960843	best: 2.2960843 (4)	total: 63ms	remaining: 0us

bestTest = 2.296084343
bestIteration = 4

Out[71]:
<catboost.core.CatBoostRegressor at 0x7f457b4fa9b0>

Initial training attempt with LGBMRegressor

Tweaked params few times to get optimal RMSE, visualized results with eli5 and shap modules.

In [72]:
# Simple LightGBM Regressor to get a feel of training on 80/20 split training/vaidation data

params = {'num_leaves': 30,
         'min_data_in_leaf': 20,
         'objective': 'regression',
         'max_depth': 5,
         'learning_rate': 0.01,
         "boosting": "gbdt",
         "feature_fraction": 0.9,
         "bagging_freq": 1,
         "bagging_fraction": 0.9,
         "bagging_seed": 11,
         "metric": 'rmse',
         "lambda_l1": 0.2,
         "verbosity": -1}
model1 = lgb.LGBMRegressor(**params, n_estimators = 20000, nthread = 4, n_jobs = -1)
model1.fit(x_train, y_train, 
        eval_set=[(x_train, y_train), (x_valid, y_valid)], eval_metric='rmse',
        verbose=1000, early_stopping_rounds=200)
Training until validation scores don't improve for 200 rounds.
[1000]	training's rmse: 1.30801	valid_1's rmse: 1.90093
Early stopping, best iteration is:
[1697]	training's rmse: 1.05842	valid_1's rmse: 1.89381
Out[72]:
LGBMRegressor(bagging_fraction=0.9, bagging_freq=1, bagging_seed=11,
              boosting='gbdt', boosting_type='gbdt', class_weight=None,
              colsample_bytree=1.0, feature_fraction=0.9,
              importance_type='split', lambda_l1=0.2, learning_rate=0.01,
              max_depth=5, metric='rmse', min_child_samples=20,
              min_child_weight=0.001, min_data_in_leaf=20, min_split_gain=0.0,
              n_estimators=20000, n_jobs=-1, nthread=4, num_leaves=30,
              objective='regression', random_state=None, reg_alpha=0.0,
              reg_lambda=0.0, silent=True, subsample=1.0,
              subsample_for_bin=200000, subsample_freq=0, verbosity=-1)
In [73]:
eli5.show_weights(model1, feature_filter=lambda x: x != '<BIAS>')
Out[73]:
Weight Feature
0.2258 _budget_year_ratio
0.1081 _rating_totalVotes_ratio
0.0690 budget
0.0670 release_year
0.0435 _popularity_mean_year
0.0366 totalVotes
0.0365 _totalVotes_releaseYear_ratio
0.0365 popularity2
0.0253 _runtime_rating_ratio
0.0249 _popularity_totalVotes_ratio
0.0173 rating
0.0164 release_day
0.0153 popularity
0.0151 originalBudget
0.0140 _num_cast
0.0139 weightedRating
0.0127 overview_word_count
0.0122 _budget_totalVotes_ratio
0.0120 _rating_popularity_ratio
0.0119 release_dayofweek
… 184 more …
In [74]:
# Using shap to visualize feature importances

explainer = shap.TreeExplainer(model1, x_train)
shap_values = explainer.shap_values(x_train)

shap.summary_plot(shap_values, x_train)
In [75]:
# Some more visualizatons with shap

top_columns = x_train.columns[np.argsort(shap_values.std(0))[::-1]][:10]
for col in top_columns:
    shap.dependence_plot(col, shap_values, x_train)

XGBoost, LightGBM, CatBoost models

For actual training, I intend to use XGBoost, LightGBM, and CatBoost models, with K-fold cross validation with 6 splits (k=6). Will use a blended error score

Setting up 6-fold validation with shuffling

In [76]:
# setting up for K-fold cross validation, with 6 folds.

random_seed = 200
n_fold = 6
folds = KFold(n_splits=n_fold, shuffle=True, random_state= random_seed)

Reusable train function, for training chosen algorithms

In [77]:
# Helper function to train various models
def train_model(X, X_test, y, params=None, folds=folds, model_type='lgb', plot_feature_importance=False, model=None):

    oof = np.zeros(X.shape[0])
    prediction = np.zeros(X_test.shape[0])
    scores = []
    feature_importance = pd.DataFrame()
    
    # Repeat for each fold of k-folds
    for fold_n, (train_index, valid_index) in enumerate(folds.split(X)):
        print('Fold', fold_n, 'started at', time.ctime())
        
        X_train, X_valid = X.values[train_index], X.values[valid_index]
        y_train, y_valid = y[train_index], y[valid_index]
        
        if model_type == 'lgb':
            model = lgb.LGBMRegressor(**params, n_estimators = 20000, nthread = 4, n_jobs = -1)
            model.fit(X_train, y_train, 
                    eval_set=[(X_train, y_train), (X_valid, y_valid)], eval_metric='rmse',
                    verbose=1000, early_stopping_rounds=200)
            
            y_pred_valid = model.predict(X_valid)
            y_pred = model.predict(X_test, num_iteration=model.best_iteration_)
            
        if model_type == 'xgb':
            train_data = xgb.DMatrix(data=X_train, label=y_train)
            valid_data = xgb.DMatrix(data=X_valid, label=y_valid)

            watchlist = [(train_data, 'train'), (valid_data, 'valid_data')]
            model = xgb.train(dtrain=train_data, num_boost_round=20000, evals=watchlist, early_stopping_rounds=200, verbose_eval=500, params=params)
            y_pred_valid = model.predict(xgb.DMatrix(X_valid), ntree_limit=model.best_ntree_limit)
            y_pred = model.predict(xgb.DMatrix(X_test.values), ntree_limit=model.best_ntree_limit)
            
        if model_type == 'cat':
            model = CatBoostRegressor(iterations=20000,  eval_metric='RMSE', **params)
            model.fit(X_train, y_train, eval_set=(X_valid, y_valid), cat_features=[], use_best_model=True, verbose=False)

            y_pred_valid = model.predict(X_valid)
            y_pred = model.predict(X_test)
        
        oof[valid_index] = y_pred_valid.reshape(-1,)
        scores.append(mean_squared_error(y_valid, y_pred_valid) ** 0.5)
        
        prediction += y_pred    
        
        # accumulate feature imporatnces for LGB model
        if model_type == 'lgb':
            # feature importance
            fold_importance = pd.DataFrame()
            fold_importance["feature"] = X.columns
            fold_importance["importance"] = model.feature_importances_
            fold_importance["fold"] = fold_n + 1
            feature_importance = pd.concat([feature_importance, fold_importance], axis=0)

    prediction /= n_fold
    
    print('CV mean score: {0:.4f}, std: {1:.4f}.'.format(np.mean(scores), np.std(scores)))
    
    if model_type == 'lgb':
        feature_importance["importance"] /= n_fold
        if plot_feature_importance:
            cols = feature_importance[["feature", "importance"]].groupby("feature").mean().sort_values(
                by="importance", ascending=False)[:50].index

            best_features = feature_importance.loc[feature_importance.feature.isin(cols)]

            plt.figure(figsize=(16, 12));
            sns.barplot(x="importance", y="feature", data=best_features.sort_values(by="importance", ascending=False));
            plt.title('LGB Features (avg over folds)');
        
            return oof, prediction, feature_importance
        return oof, prediction
    
    else:
        return oof, prediction

Train LightGBM

In [78]:
# train LightGBM model

params = {'objective':'regression',
         'num_leaves' : 30,
         'min_data_in_leaf' : 20,
         'max_depth' : 9,
         'learning_rate': 0.004,
         'min_child_samples':100,
         'feature_fraction':0.9,
         "bagging_freq": 1,
         "bagging_fraction": 0.9,
         'lambda_l1': 0.2,
         "bagging_seed": random_seed,
         "metric": 'rmse',
         'subsample':.8, 
         'colsample_bytree':.9,
         "random_state" : random_seed,
         "verbosity": -1}
oof_lgb, prediction_lgb, _ = train_model(train, test, y, params=params, model_type='lgb', 
                                         plot_feature_importance=True)
Fold 0 started at Tue Aug 27 20:21:22 2019
Training until validation scores don't improve for 200 rounds.
[1000]	training's rmse: 1.35149	valid_1's rmse: 2.04582
Early stopping, best iteration is:
[961]	training's rmse: 1.37031	valid_1's rmse: 2.04519
Fold 1 started at Tue Aug 27 20:21:26 2019
Training until validation scores don't improve for 200 rounds.
[1000]	training's rmse: 1.35755	valid_1's rmse: 2.04884
[2000]	training's rmse: 1.03973	valid_1's rmse: 2.0326
Early stopping, best iteration is:
[2346]	training's rmse: 0.955397	valid_1's rmse: 2.02943
Fold 2 started at Tue Aug 27 20:21:35 2019
Training until validation scores don't improve for 200 rounds.
[1000]	training's rmse: 1.356	valid_1's rmse: 2.11877
[2000]	training's rmse: 1.03922	valid_1's rmse: 2.10619
Early stopping, best iteration is:
[2143]	training's rmse: 1.00494	valid_1's rmse: 2.10475
Fold 3 started at Tue Aug 27 20:21:43 2019
Training until validation scores don't improve for 200 rounds.
[1000]	training's rmse: 1.34147	valid_1's rmse: 2.0729
[2000]	training's rmse: 1.00665	valid_1's rmse: 2.05665
Early stopping, best iteration is:
[2037]	training's rmse: 0.996273	valid_1's rmse: 2.05567
Fold 4 started at Tue Aug 27 20:21:51 2019
Training until validation scores don't improve for 200 rounds.
[1000]	training's rmse: 1.34784	valid_1's rmse: 1.97844
[2000]	training's rmse: 1.01383	valid_1's rmse: 1.94897
Early stopping, best iteration is:
[2281]	training's rmse: 0.946658	valid_1's rmse: 1.94684
Fold 5 started at Tue Aug 27 20:21:59 2019
Training until validation scores don't improve for 200 rounds.
Early stopping, best iteration is:
[758]	training's rmse: 1.47472	valid_1's rmse: 2.03279
CV mean score: 2.0358, std: 0.0469.

Train XGBoost

In [79]:
# Train XGBoost model

xgb_params = {'eta': 0.01,
              'objective': 'reg:linear',
              'max_depth': 6,
              'subsample': 0.6,
              'colsample_bytree': 0.7,
              'eval_metric': 'rmse',
              'seed': 25,
              'silent': True}
oof_xgb, prediction_xgb = train_model(train, test, y, params=xgb_params, model_type='xgb')
Fold 0 started at Tue Aug 27 20:22:05 2019
[0]	train-rmse:15.6176	valid_data-rmse:15.5637
Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.

Will train until valid_data-rmse hasn't improved in 200 rounds.
[500]	train-rmse:1.21579	valid_data-rmse:2.03961
[1000]	train-rmse:0.847146	valid_data-rmse:2.01592
Stopping. Best iteration:
[1102]	train-rmse:0.791398	valid_data-rmse:2.01326

Fold 1 started at Tue Aug 27 20:22:18 2019
[0]	train-rmse:15.5979	valid_data-rmse:15.6605
Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.

Will train until valid_data-rmse hasn't improved in 200 rounds.
[500]	train-rmse:1.20542	valid_data-rmse:2.07111
[1000]	train-rmse:0.846431	valid_data-rmse:2.02535
[1500]	train-rmse:0.628602	valid_data-rmse:2.01456
Stopping. Best iteration:
[1764]	train-rmse:0.53903	valid_data-rmse:2.01324

Fold 2 started at Tue Aug 27 20:22:39 2019
[0]	train-rmse:15.5988	valid_data-rmse:15.6526
Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.

Will train until valid_data-rmse hasn't improved in 200 rounds.
[500]	train-rmse:1.21615	valid_data-rmse:2.15336
[1000]	train-rmse:0.854758	valid_data-rmse:2.13988
[1500]	train-rmse:0.625917	valid_data-rmse:2.13631
Stopping. Best iteration:
[1421]	train-rmse:0.656551	valid_data-rmse:2.13458

Fold 3 started at Tue Aug 27 20:22:57 2019
[0]	train-rmse:15.6293	valid_data-rmse:15.5026
Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.

Will train until valid_data-rmse hasn't improved in 200 rounds.
[500]	train-rmse:1.2077	valid_data-rmse:2.07787
[1000]	train-rmse:0.833905	valid_data-rmse:2.06492
Stopping. Best iteration:
[1039]	train-rmse:0.812116	valid_data-rmse:2.06346

Fold 4 started at Tue Aug 27 20:23:10 2019
[0]	train-rmse:15.6097	valid_data-rmse:15.5972
Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.

Will train until valid_data-rmse hasn't improved in 200 rounds.
[500]	train-rmse:1.22087	valid_data-rmse:1.98408
[1000]	train-rmse:0.854067	valid_data-rmse:1.94143
[1500]	train-rmse:0.630586	valid_data-rmse:1.93233
Stopping. Best iteration:
[1513]	train-rmse:0.626682	valid_data-rmse:1.93184

Fold 5 started at Tue Aug 27 20:23:28 2019
[0]	train-rmse:15.5955	valid_data-rmse:15.6732
Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.

Will train until valid_data-rmse hasn't improved in 200 rounds.
[500]	train-rmse:1.21256	valid_data-rmse:2.07169
Stopping. Best iteration:
[723]	train-rmse:1.01404	valid_data-rmse:2.05937

CV mean score: 2.0360, std: 0.0618.

Train CatBoost

In [80]:
# Train CatBoost model

cat_params = {'learning_rate': 0.004,
              'depth': 5,
              'l2_leaf_reg': 10,
              'colsample_bylevel': 0.8,
              'bagging_temperature': 0.2,
              'od_type': 'Iter',
              'od_wait': 100,
              'random_seed': random_seed,
              'allow_writing_files': False}

oof_cat, prediction_cat = train_model(train, test, y, params=cat_params, model_type='cat')
Fold 0 started at Tue Aug 27 20:23:38 2019
Fold 1 started at Tue Aug 27 20:24:52 2019
Fold 2 started at Tue Aug 27 20:25:46 2019
Fold 3 started at Tue Aug 27 20:26:36 2019
Fold 4 started at Tue Aug 27 20:30:32 2019
Fold 5 started at Tue Aug 27 20:33:10 2019
CV mean score: 2.0140, std: 0.0752.

Build stack of all 3 models and train the stack

In [81]:
# prepare stack for stacked model training

train_stack = np.vstack([oof_lgb, oof_xgb, oof_cat]).transpose()
train_stack = pd.DataFrame(train_stack, columns=['lgb', 'xgb', 'cat'])
test_stack = np.vstack([prediction_lgb, prediction_xgb, prediction_cat]).transpose()
test_stack = pd.DataFrame(test_stack, columns=['lgb', 'xgb', 'cat'])
In [82]:
# Train stacked model

params = {'num_leaves': 8,
         'min_data_in_leaf': 20,
         'objective': 'regression',
         'max_depth': 3,
         'learning_rate': 0.01,
         "boosting": "gbdt",
         "bagging_seed": 11,
         "metric": 'rmse',
         "lambda_l1": 0.2,
         "verbosity": -1}
oof_lgb_stack, prediction_lgb_stack, _ = train_model(train_stack, test_stack, y, 
                                                     params=params, model_type='lgb', plot_feature_importance=True)
Fold 0 started at Tue Aug 27 20:35:22 2019
Training until validation scores don't improve for 200 rounds.
Early stopping, best iteration is:
[700]	training's rmse: 1.90865	valid_1's rmse: 2.00362
Fold 1 started at Tue Aug 27 20:35:23 2019
Training until validation scores don't improve for 200 rounds.
Early stopping, best iteration is:
[289]	training's rmse: 1.95417	valid_1's rmse: 2.0395
Fold 2 started at Tue Aug 27 20:35:23 2019
Training until validation scores don't improve for 200 rounds.
Early stopping, best iteration is:
[589]	training's rmse: 1.89452	valid_1's rmse: 2.11975
Fold 3 started at Tue Aug 27 20:35:23 2019
Training until validation scores don't improve for 200 rounds.
Early stopping, best iteration is:
[294]	training's rmse: 1.95406	valid_1's rmse: 2.04168
Fold 4 started at Tue Aug 27 20:35:24 2019
Training until validation scores don't improve for 200 rounds.
Early stopping, best iteration is:
[541]	training's rmse: 1.94317	valid_1's rmse: 1.96711
Fold 5 started at Tue Aug 27 20:35:24 2019
Training until validation scores don't improve for 200 rounds.
Early stopping, best iteration is:
[196]	training's rmse: 1.99755	valid_1's rmse: 2.03931
CV mean score: 2.0352, std: 0.0463.

K-Fold cross validation on final blended model - 10 folds for validation

In [83]:
# setting up for K-fold cross validation, with 6 folds.

random_seed = 200
folds = 10
folds = KFold(n_splits=folds, shuffle=True, random_state= random_seed)
In [84]:
# Train stacked model

params = {'num_leaves': 8,
         'min_data_in_leaf': 20,
         'objective': 'regression',
         'max_depth': 3,
         'learning_rate': 0.01,
         "boosting": "gbdt",
         "bagging_seed": 11,
         "metric": 'rmse',
         "lambda_l1": 0.2,
         "verbosity": -1}
oof_lgb_stack, prediction_lgb_stack, _ = train_model(train_stack, test_stack, y, 
                                                     params=params, model_type='lgb', plot_feature_importance=True)
Fold 0 started at Tue Aug 27 20:35:25 2019
Training until validation scores don't improve for 200 rounds.
Early stopping, best iteration is:
[700]	training's rmse: 1.90865	valid_1's rmse: 2.00362
Fold 1 started at Tue Aug 27 20:35:25 2019
Training until validation scores don't improve for 200 rounds.
Early stopping, best iteration is:
[289]	training's rmse: 1.95417	valid_1's rmse: 2.0395
Fold 2 started at Tue Aug 27 20:35:25 2019
Training until validation scores don't improve for 200 rounds.
Early stopping, best iteration is:
[589]	training's rmse: 1.89452	valid_1's rmse: 2.11975
Fold 3 started at Tue Aug 27 20:35:25 2019
Training until validation scores don't improve for 200 rounds.
Early stopping, best iteration is:
[294]	training's rmse: 1.95406	valid_1's rmse: 2.04168
Fold 4 started at Tue Aug 27 20:35:26 2019
Training until validation scores don't improve for 200 rounds.
Early stopping, best iteration is:
[541]	training's rmse: 1.94317	valid_1's rmse: 1.96711
Fold 5 started at Tue Aug 27 20:35:26 2019
Training until validation scores don't improve for 200 rounds.
Early stopping, best iteration is:
[196]	training's rmse: 1.99755	valid_1's rmse: 2.03931
CV mean score: 2.0352, std: 0.0463.

Predictions

Now with training of the 3 chosen modeling techniques and stacked model with all 3 in-place, let us tackle predictions on test dataset, and see how closely they are matching with one another, and pick final model

In [85]:
# read submission dataset, call each of our models individually, blended, stacked...
# write outputs for each call, as well as all predictions

sub = pd.read_csv('data/sample_submission.csv')
all_pred = pd.read_csv('data/sample_submission.csv')

sub['revenue'] = np.expm1(prediction_lgb)
sub.to_csv("output/lgb.csv", index=False)
all_pred['pred_lgb'] = np.expm1(prediction_lgb)

sub['revenue'] = np.expm1(prediction_xgb)
sub.to_csv("output/xgb.csv", index=False)
all_pred['pred_xgb'] = np.expm1(prediction_xgb)
                          
sub['revenue'] = np.expm1(prediction_cat)
sub.to_csv("output/cat.csv", index=False)
all_pred['pred_cat'] = np.expm1(prediction_cat)
                          
sub['revenue'] = np.expm1((prediction_lgb + prediction_xgb) / 2)
sub.to_csv("output/blend_lgb_xgb.csv", index=False)
all_pred['pred_blend_lgb_xgb'] = np.expm1((prediction_lgb + prediction_xgb) / 2)
                          
sub['revenue'] = np.expm1((prediction_lgb + prediction_xgb + prediction_cat) / 3)
sub.to_csv("output/blend_all3.csv", index=False)
all_pred['pred_blend_all3'] = np.expm1((prediction_lgb + prediction_xgb + prediction_cat) / 3)

sub['revenue'] = prediction_lgb_stack
sub.to_csv("output/stack_lgb.csv", index=False)

all_pred['pred_stack_lgb'] = np.expm1(prediction_lgb_stack)
all_pred.to_csv("output/all_predictions.csv", index=False)
In [86]:
all_pred.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4398 entries, 0 to 4397
Data columns (total 8 columns):
id                    4398 non-null int64
revenue               4398 non-null int64
pred_lgb              4398 non-null float64
pred_xgb              4398 non-null float64
pred_cat              4398 non-null float64
pred_blend_lgb_xgb    4398 non-null float64
pred_blend_all3       4398 non-null float64
pred_stack_lgb        4398 non-null float64
dtypes: float64(6), int64(2)
memory usage: 275.0 KB

Visualize correlation as HeatMap across various models

Visualizing the HeatMap, blend of all 3 models pred_blend_all3 is shows very high correlation, perfect correlation with 2 other models and 0.99 correlation with 2 other. I am going to pick that as my final model.

In [87]:
# Heatmap visualization for predictions across models

pred_viz = all_pred[['pred_lgb','pred_xgb','pred_cat','pred_blend_lgb_xgb','pred_blend_all3','pred_stack_lgb']]
f,ax = plt.subplots(figsize=(20, 10))
sns.heatmap(pred_viz.corr(), annot=True)
plt.show()
In [88]:
# Scatter matrix visualization of predictions across models

pd.plotting.scatter_matrix(pred_viz, alpha = 0.5, figsize = (14,8), diagonal = 'kde');
In [ ]: